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Abstract 

Humans and other animals can adapt their social behavior in response to environmen- 
tal cues including the feedback obtained through experience. Nevertheless, the effects of 
the experience-based learning of players in evolution and maintenance of cooperation in 
social dilemma games remain relatively unclear. Some previous literature showed that 
mutual cooperation of learning players is difficult or requires a sophisticated learning 
model. In the context of the iterated Prisoner's Dilemma, we numerically examine the 
performance of a reinforcement learning model. Our model modifies those of Karandikar 
et al. (1998), Posch et al. (1999), and Macy and Flache (2002) in which players satisfice if 
the obtained payoff is larger than a dynamic threshold. We show that players obeying the 
modified learning mutually cooperate with high probability if the dynamics of threshold 
is not too fast and the association between the reinforcement signal and the action in 
the next round is sufficiently strong. The learning players also perform efficiently against 
the reactive strategy. In evolutionary dynamics, they can invade a population of players 
adopting simpler but competitive strategies. Our version of the reinforcement learning 
model does not complicate the previous model and is sufficiently simple yet fiexible. It 
may serve to explore the relationships between learning and evolution in social dilemma 
situations. 

Keywords: cooperation, direct reciprocity. Prisoner's Dilemma, reinforcement learning 
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1 Introduction 



Human beings and other animals often cooperate with each other even in social dilemma sit- 
uations where to not cooperate is apparently a rational choice. A standard framework in 
which social dilemma situations are studied is the Prisoner's Dilemma game (PD) and its 
variants. Many theoretical mechanisms for emergence and maintenance of cooperation in 
social dilemma games have been reported thus far ([Axelrod, 1984[ |Boyd Richerson, 1985 



Nowak, 2006 Sigmund, 2010). 



Most of these mechanisms do not deal with the adaptation or learning of individuals. We 
use the term learning to refer to individual learning (i.e., experience-based adaptation), but 
not to social learning (i.e., imitation). Learning implies that an individual takes advantage 
of the history of the games that it has played to perform better in subsequent rounds. A 
learning individual changes behavior on the basis of some statistics of the game results. Lab- 



oratory experiments suggest that humans do learn during sequences of games (Camerer, 2003 



Glimcher et ai, 2009). The learning of the social behavior of animals, including humans. 



has been modeled in various game and non-game situations ([Rapoport fc Chammah, 1965 



Cross, 1983t |Boyd Richerson, 1985t [Fudenberg fc Levine, 1998} [Camerer, 2003D 



Learning in a game is relevant only in an iterated game. It is well known that mutual cooper- 
ation can be optimal in the iterated PD ( Trivers, 1971 Axelrod, 1984 ). Action rules that have 
mainly been considered in the context of the iterated PD are those that do not adjust conditional 
probabilities of cooperation upon experience. A player using a look-up table that relates the 
next action to the outcome of the game in the current and past few rounds belongs to this class 
qAxelrod, 19841 [Kraines fc Kraines, 198"9l [Nowak fc Sigmund, 198"9l [Nowak fc Sigmund, 1990 



Nowak, 1990 


Lindgren, 1991 


Nowak & Sigmund, 1992 


Nowak & Sigmund, 1993 


Nowak, 2006 



Sigmund, 2010). This important class includes well-known strategies such as the tit-for-tat 



3 



(TFT). However, the flexibility of such a strategy appears to be limited. 

Players using reinforcement learning, on which we focus in this study, exploit information 
about past encounters to adapt the probability of cooperation conditioned by the outcome 
of the game in a couple of past rounds. Because of their flexibility, such learning players 
may be strong competitors in the iterated PD. If learning players compete relatively well in a 
population, the learning behavior may spread to become dominant in the population through 
evolutionary dynamics. Nevertheless, the possible roles of reinforcement learning in the it- 
erated PD, either in favor of or against the promotion of cooperation, are relatively unex- 
plored. In fact, players using reinforcement learning have generally been unsuccessful in the 
PD and other social dilemma games (Macy, 1996 Sandholm fc Crites, 1996" Posch et al, 1999 



Taiji fc Ikegami, 1999| |Macy fc Flache, 2002[rMasuda fc Ohtsuki, 2009|). Although an artificial 



neural network model, for example, enables mutual cooperation ( Gutnisky fc Zanutto, 2004 ), 
such a complicated mechanism may not be implemented by humans or other animals. It seems 
that the current understanding of social dilemmas is mostly based on studies in the fields of 
evolutionary biology and economics. Because experience-based learning, and reinforcement 
learning in particular, is quite evident in humans and other animals, both in terms of behavior 



and neural activities ( Camerer, 2003 Glimcher et al, 2009 ), clarifying the role of reinforcement 
learning in the iterated PD may provide an additional understanding of how subjects cope with 
social dilemmas. 

In the present study, we numerically examine a variant of the reinforcement learning model 
dKarandikar et al, 1998] |Posch et al, 19991 |Macy fc Flache, 2002[ ) in the iterated PD. Follow- 
ing Macy and Flache (2002), we call the original model the Bush-Mosteller (BM) model. A 
player obeying the BM reinforcement learning (BM player for short) would continue an action 
(i.e., cooperate or defect) after gaining a relatively large payoff and would switch the action oth- 
erwise. If the threshold payoff above which the player satisfices, which is called the aspiration 



level, is fixed, BM players can mutually cooperate ( Rapoport Chammah, 1965 Macy, 1991 
Macy 19961 IPosch et al, 1999tpacy fc Flache, 2002| [Izquierdo et a/., 20071 [Izquierdo et a/., 2008| ) 
The BM player with the fixed aspiration level studied in these articles is essentially the same 
as Pavlov that only uses the information about the immediate past (Kraines fc Kraines, 1989" 



Nowak fc Sigmund, 1993 ). Pavlov is known to be exploited by the unconditional defector and 
behave too generously to the unconditional cooperator. 

Real subjects may adapt the aspiration level in response to the results of the game ( Simon, 1959 ) 
The BM model with the adaptive aspiration level is not known to yield a large probability of 



mutual cooperation except in some limited cases (Karandikar et al, 1998 Posch et al, 1999 



Macy fc Flache, 2002 ). We remark that performance of other reinforcement learning mod- 
els with the adaptive aspiration level have also been investigated in the PD and other games 
dPazgal, 19971 IKim, 1999] IPalomino fc Vega-Redondo, 19991 IDixon, 2000llBorgers fc Sarin, 2000 



Oechssler, 2"002| IBendor et al, 20031 |Napel, 2003] ICho fc Matsui, 20051 ). In the temporal dif- 
ference learning, which is a dominant form of reinforcement learning in the brain, dopamine 
neurons represent the difference between the obtained reward and the dynamic expected reward 



that changes according to the subject's experience (Schultz et al, 1997 Montague fc Berns, 2002 



Daw fc Doya, 2006 Glimcher et al, 2009 ). The reinforcement signal in the BM model with the 
adaptive aspiration level is given by the difference between the obtained reward and the dy- 
namically changing aspiration level such that the BM model with the adaptive aspiration level 
is at least loosely connected to neural evidence. 

We show that a simple modification of the BM model with the adaptive aspiration level 
drastically changes the behavior of the player. The modified BM player mutually cooperates 
with a large probability and is competitive in evolutionary dynamics. The modification is 
done such that the reinforcement signal is refiected to the action selection in the next round 
fairly strongly. The aspiration level must adapt with a low to intermediate learning rate for 



sustaining cooperation. It should be noted that our modification to the BM model does not 
introduce an additional complexity to the original BM model with the adaptive aspiration 
level dKarandikar et al, 1998] |Posch et al, 1999| |Macy fc Flache, 2002^. 



2 Model 



We consider the symmetric two-person PD whose payoff matrix is given by 



C 
D 



C D 

/ \ 

R S 

T P 



(1) 



where T > R > P > S and R > (T + S)/2. The entries of Eq. ([T]) represent the payoffs that 
the row player gains. Each row (column) corresponds to the action of the row (column) player, 
i.e., cooperation (C) or defection (D). Because T > R and P > S, mutual defection is the 
only Nash equilibrium of the single-shot game. Unless otherwise stated, we assume a standard 
payoff matrix for the PD given by i? = 3, T = 5, S* = 0, and P = 1. 

A pair of players play the PD for a predetermined number of rounds denoted by tmax- 
We denote the round by t (= 1, 2, . . .). Although the Nash equilibrium of the iterated PD 
is defection in all the rounds, which can be derived by backward induction, we assume for 
simplicity that players do not carry out backward induction. We could avoid this technical 
subtlety by assuming that a next round occurs with a certain probability such that the last 



round is not known beforehand ( Axelrod, 1984 Nowak, 2006 ). 

To model a learning player, we use a variant of the BM reinforcement learning model adapted 
to the game situation, pioneered in Rapoport and Chammah (1965). Our model is a variant of 
the BM model with the adaptive aspiration level ( Karandikar et al, 1998 Posch et al, 1999 
Macy fc Flache, 2002[). 
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In round t, the BM player intends to cooperate with probabihty pt- We set the initial 
condition to pi = 0.5. In addition, we assume that the player misimplements the action (i.e., 
C or D) to play the opposite action with a small probability e. The payoff that the BM player 
gains in round t is denoted as G {R, T, S, P}. We define the stimulus, or the reinforcement 
signal, using the sigmoidal function as 

St = tanh [(3{rt - At)] , (2) 

where At is the aspiration level in round t above which the BM player satisfices. The degree 
of satisfaction is parametrized by St, and — 1 < st < 1 holds true. If St > {st < 0), the BM 
player is motivated to keep (switch) the current action in the next round. The sensitivity of 
the stimulus to the reinforcement signal rj — At is parametrized by /3 > 0. 
The dynamics of the probability of cooperation are given by 



Pt+i = < 



' Pt + (1 — Pt)st, (Action in round t = C, and Sj > 0), 

Pt + PtSt, (Action in round t = C, and Sj < 0), 

Pt — PtSt, (Action in round t = D, and Sj > 0), 

^Pt — (1 — Pt)st, (Action in round t = D, and St < 0). 



(3) 



Finally, the dynamics of the aspiration level are given by 

At+i = (1 - h)At + hrt, (4) 

where h represents the learning rate of the aspiration level, which is also called habitua- 
tion (|Macy Flache, 2002]). In contrast to previous models in which h decays as t increases 



(Erev fc Roth, 1998 Cho Matsui, 2005), we assume that /i is a fixed constant. Unless oth- 



erwise stated, we set the initial value of At to Ai = (i? + T + 5 + P)/4, which is equal 
to the expected payoff when there are an equal number of cooperators and defectors in a 
population. As a remark, the possibility of cooperation in the iterated PD and other games 



was examined when the update of At is driven by the average payoff over time (Kim, 1999 



Cho Matsui, 2005D, the maximal experienced payoffs ([Pazgal, 1997D, or the payoff averaged 



over the population ([Palomino fc Vega-Redondo, 1999[ [Oechssler, 2002D 



The difference between our model and the Macy-Flache model (Macy & Flache, 2002) lies 



in Eq. ([2]). Macy and Flache use St = ^{rt — At) / max[T — At, At — S] instead of Eq. (|2]). As 
described below, this difference results in a remarkable difference in the behavior of the player. 
In other words, we show that reacting strongly to the play in the previous round (i.e., large (3) is 
necessary for mutual cooperation. A deterministic decision maker with the adaptive aspiration 
level used in Posch et al. (1999) corresponds to (3 = oo. We numerically show that /3 does 
not have to be extremely large for mutual cooperation. We remark that, if /3 = oo and the 
aspiration level is fixed (i.e., /i = 0), the strategy is a win-stay lose-shift one. In particular, our 
BM model with /3 = oo and /i = is equivalent to the Pavlov strategy (Kraines & Kraines, 1989 
Nowak fc Sigmund, 1993[) li P < At < R. 



3 Results 

3.1 BM versus BM 

In this section, we examine the performance of a BM player playing against another BM player. 
We assume that the two players employ the same values of /3 and h. For a range of /3 and /i, 
the fraction of the rounds in which the focal BM player cooperates is shown for three values 
of implementation error, e = 0, 0.01, 0.1, and two values of the number of rounds, tmax = 100, 
1000, in Fig. [TJ The presented values are averages over 100 trials in this and the following 
figures unless otherwise stated. The fraction of cooperation is large when h is small and /3 is 
large. The results are fairly robust, despite some degradation, even under 10% of the error in 
the action implementation (Fig. [Tt^c, f)). Remarkably, a large fraction of cooperation can be 
established only after tmax = 100 rounds (Fig. [T](d, e, f)). These results are in contrast to those 
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for other reinforcement learning models for social dilemma games, where the establishment of 
mutual cooperation requires a large number of rounds ( Masuda Ohtsuki, 2009 ) or is sim- 
ply difficult ( |Macy, 1996 | [Sandholm fc Crites, 1996| |Posch et al, 19991 |Taiji fc Ikegami, 1999 
Macy fc Flache, 20021 pasuda fc Ohtsuki, 2009[). 



In Fig.lH P must be larger than approximately 2.7 for the fraction of cooperation to be large 
for small h. When /3 is in this range, Eq. suggests that the reinforcement signal St would 
be typically close to —1 or 1 before a possible equilibrium is reached. This is because |rj — A^l 
is typically about unity or larger when R = 3, T = 5, S = 0, and P = 1. Then, Eq. ([3]) implies 
that pt is close to or 1, and the selection of the action tends to be almost deterministic. This 
deterministic nature of the BM player seems to pave the way to mutual cooperation. This 
result is consistent with those obtained from other models of reinforcement learning with the 
adaptive aspiration level ( Palomino fc Vega-Redondo, 1999 Oechssler, 2002 ). 

Some mutual cooperation also occurs in the Macy-Flache original BM model with the 
adaptive aspiration level (Macy & Flache, 2002). For the sake of comparison, the fraction of 
cooperation in the Macy-Flache model with e = and tmax = 1000 is shown in Fig. [2] for 
various values of h and the sensitivity to the stimulus i. The fraction of cooperation is much 
smaller than that for our model. We consider that this is because the stimulus St with which 
to update the probability to cooperate in the next round is not sufficiently sensitive to the 
reinforcement signal — At in the Macy-Flache model. To satisfy — 1 < < 1 such that 
Eq. ([3]) is well-defined, we need i < 1. Then, St would not be close to —1 or 1 in a considerable 
number of rounds. Then, the action in the next round is not likely to be very sensitive to the 
result of the game in the current round. Regardless of the value of £, Macy's model roughly 
corresponds to our model with a small value of /3. This interpretation is consistent with the 
result that a small /3 yields a small fraction of cooperation in our model (Fig. [1]). 

Our results are also consistent with those in Posch et al. (1999), in which the authors 
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analytically showed that mutual cooperation is difficult when h = 1 and (3 = oo (their 
YESTERDAY strategy) and that mutual cooperation is established if the temptation pay- 
off T is not too large when h is tiny and (3 = oo (their FARAWAY strategy). A sufficiently 
small h combined with slight stochasticity in the dynamics of h also leads to mutual coopera- 



tion ( Karandikar et ai, 1998 ). Our numerical results extend their analytical results in showing 
that the BM players mutually cooperate up to an intermediate value of /i if /3 is sufficiently 
large. 

To test the robustness of the results against changes in the payoff matrix, we set i? = 6 — c, 
T = b, S = — c, and P = 0, and measure the fraction of cooperation as a function of h and the 
benefit-to-cost ratio b/c. We set /3 = 3, for which the BM player mutually cooperates when 
R = 3, T = 5, S = 0, P = 1, and h is small (Fig. [I]). The results for = 1000, e = 0.02, 
and c = 1 are shown in Fig. [31 The cooperation decreases with an increase in h. Nevertheless, 
the threshold value of b/c above which the BM players mutually cooperate with a probability 
close to unity differs only slightly up to /i ~ 0.25. 

A small h requires a relatively large number of rounds before the cooperative equilibrium is 
reached, even if the parameter values are set to yield a cooperative equilibrium. The fraction 
of cooperation when tjnax = 1000, e = 0.02, (3 = 3, R = 3, T = 5, S = 0, and P = 1 is 
shown in Fig. IHfor various values of h and initial aspiration level Ai. Figure S] suggests that 
h > 0.03 is necessary for h to relax to an equilibrium value within tmax = 1000 rounds. When 
h is too small, the fraction of cooperation strongly depends on Ai. If P < < R, the BM 
player is essentially the same as Pavlov for such a small h. In this case, mutual cooperation 



is realized, reflecting the fact that Pavlov cooperates against itself ([Kraines Kraines, 1989" 



Nowak fc Sigmund, 1993 ). However, if we start from a different Ai, the fraction of cooperation 
would be small for a small value of h. 

Two BM players may have different parameter values. Because Fig. H] suggests that the 
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value of Ai is irrelevant to the fraction of cooperation unless h is too small, we set Ai = 
{R + T -\- S + P)/4 and examine the case in which two players have different values of h and 
p. We set h = 0.3 and /3 = 3 for a focal BM player. For the opponent BM player with the 
identical values of h and (3, Figs. [1] and [3] guarantee that the two players mutually cooperate. 
With tmax = 1000 and e = 0.02, the fraction of cooperation and the mean payoff for the 
focal player when the opponent has different values of h and (3 are shown in Fig. El^a) and 
Mjo), respectively. Figure [St^a) indicates that the focal BM player mostly cooperates with the 
opponent with similar values of h and f3. Although the fraction of cooperation is small when 
the opponent has small (3, the focal BM player avoids being exploited by the opponent in this 
way (Fig. EJ^b)). In both cases, the focal BM player performs well against the BM opponent. 

3.2 BM against reactive strategies 

We examine the behavior of the BM player against players adopting the reactive strategy. 
A reactive strategy is an often used non-learning strategy, and it is specified by two param- 
eters p and q {p,q G [0,1]) and the initial condition. The reactive player cooperates with 
probabilities p and q when the opponent cooperates and defects in the previous round, re- 



spectively (Nowak fc Sigmund, 1989 Nowak fc Sigmund, 1992 Nowak, 2006). Unconditional 



cooperation (ALLC), unconditional defection (ALLD), and TFT correspond to {p,q) = (1,1), 
(0,0), and (1,0), respectively. We assume that a player with the reactive strategy cooperates 
in the first round. 

The fraction of cooperation of the BM player against various reactive strategies is shown 
in Fig. int^a) for tma.x = 1000, e = 0.02, h = 0.3, and (3 = 3. The BM player rarely cooperates 
with ALLC and ALLD. To never cooperate is the optimal action against these two strategies. 
The BM player cooperates with TFT in approximately half the rounds. This is not an optimal 



behavior; perpetual cooperation is optimal when the opponent is TFT (Axelrod, 1984). The 
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mean payoff for tlie BM player against the reactive strategy is shown in Fig. E^d). For all the 
values of p and q, the mean aspiration level of the BM player is indistinguishable from the mean 
payoff shown in Fig. El^d). As already implied in Fig. El^a), the BM player exploits ALLC and 
gains more than 4.5 per round. The BM is not exploited by ALLD and gains approximately 
P = 1 per round. The BM player gains approximately 2.5 per round against TFT. This 
value is smaller than but not too far from R = 3 per round, which would be obtained by 
mutual cooperation with TFT. Figure M^a) shows that the BM player cooperates with a large 
probability with generous tit-for-tat (GTFT) defined by p = 1 and g = 1/3 for the current 
payoff matrix. GTFT is a strong competitor in the iterated PD (Nowak fc Sigmund, 1992). 



Although the BM player occasionally defects against GTFT, the BM player gains ^ R = 3 per 
round, which would be obtained by mutual cooperation. 

The BM player does not play optimally against TFT. However, the BM player is gen- 
erally strong against reactive strategies, as compared to TFT and GTFT. To support this, 
we plot the fraction of cooperation and the mean payoff for TFT against the reactive strat- 
egy in Fig. IHl^b) and[6]^e), respectively. The plotted values are analytical solutions obtained 



by Nowak and Sigmund (Nowak fc Sigmund, 1989 Nowak fc Sigmund, 1990 Nowak, 1990 



Nowak, 2006 ), which are summarized in Appendix for completeness. As shown in Fig. [H^b), 
TFT does not cooperate with itself because TFT is intolerant to haphazard defection of the 
opponent (Nowak fc Sigmund, 1992). In addition, TFT does not exploit ALLC. This is why 



TFT is eventually invaded by ALLD in evolutionary simulations in which ALLC, ALLD, and 
TFT coexist (Nowak fc Sigmund, 1992 Nowak fc Sigmund, 1993). The payoff for TFT against 



the reactive strategy (Fig. M^e)) is smaller than that for the BM player (Fig. M^d)) for a wide 
range of p and q. This is particularly true for large values of p, which encompass TFT, GTFT, 
and ALLC. 

The fraction of cooperation and the mean payoff for GTFT player against the reactive 
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strategy are shown in Fig. M^c) and MS), respectively (see Appendix for derivation). The 
GTFT player performs better than the BM player and TFT for large p. However, GTFT is 
too generous to ALLC and ALLD. When p and q are both small or both large, the payoff for 
GTFT (Fig. ini^f)) is smaller than that for the BM player (Fig. EJ^d)). In addition, the payoff 
for GTFT and that for the BM player are comparable when q > p and when {p, q) is close to 
that of GTFT. On the basis of these numerical results, we conclude that the performance of 
the BM player against the reactive strategy is comparable to that of GTFT. 

3.3 Evolutionary simulations 

If the BM player is a strong competitor in the iterated PD, it should be able to evolve in a 
population in which different strategies coexist. To examine this point, we simulate evolutionary 
dynamics of populations where BM players and non-learners coexist in the beginning. We model 
non-learners by the stochastic memory-one strategy with which the player determines an action 



based on its own action and that of the opponent in the previous round ( Nowak et al, 1995 ). 

There are four types of outcomes of the pairwise interaction in a round, i.e., CC, CD, DC, 
and DD. The first and second letters (i.e., C or D) represent the actions of the focal player 
obeying the memory-one strategy (memory-one player for short) and the opponent, respectively. 
The memory-one player is parametrized by the action in the first round and four probabilities 
Pec, PcD, Pdc, and Pdd- The probability corresponding to the outcome of the present round 
is used as the probability that the memory-one player cooperates in the next round. For 
example, if both players cooperate, the memory-one player cooperates with probability pcc in 
the next round. Initially, the memory-one player is assumed to cooperate with probability pcc- 
The memory-one strategy includes many important strategies such as the reactive strategy and 



Pavlov (Nowak et al, 1995). We assume pcc, Pcd, Pdc, Pdd G {0, 1/m, 2/m, . . . , (m — l)/m, 1} 



and that there are initially an equal number of memory-one players of each type. The case 
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m = 3 with a slight modification is employed in a previous study ( Hauert fc Stenull, 2002 ). To 
be realistic, we assume that both the memory-one player and the BM player misimplement the 
intended action with probability e = 0.02. 

We denote the number of players in the population by A^. Each player in the population 
plays against each of the other — 1 players iteratively for tmax = 1000 rounds in a single 
generation. The results shown in the following are qualitatively the same if tmax is reduced 
to 500. We normalize the payoff of each player by dividing it by (A^ — l)tmax such that the 
payoff per generation falls between S and T. We update the strategy of the players during evo- 
lutionary dynamics according to the Fermi rule (Szabo fc Toke, 1998 Traulsen et al, 2006). 



At the end of each generation, we pick a pair of players i and j from the population with 
equal probability. We denote their single-generation payoffs as r^*^ and r^^\ With probability 



1/ 1 + exp (^/3(r» - r^-')) j 



player i copies the strategy of player j. With the remaining prob- 
ability, player j copies the strategy of player i. We set /3 = 1. If the parent (i.e., player whose 
strategy is copied) is the BM player, the child (i.e., player copying the strategy of the parent) 
becomes the BM learner. In this case, both the parent and the child start with pt=i = 0.5 and 
At=i = {R + T + S + P)/4 in the next generation. If the parent is a memory-one player, the 
child inherits the parent's parameter values pcc, Pcd, Pdc: and pdd- For simplicity, we do not 
consider mutations. 

To examine the possibility that the BM player invades a population of players with various 
memory-one strategies, we start evolutionary simulations with 1% of the BM players in a 
population. Two time courses of typical runs when h = 0.3 and (3 = 3 are shown in Fig. [3 In 
Fig. Wi^), we set m = 3 and prepare 10 memory-one players of each of the 4^ = 256 types and 
25 BM players in the beginning. Therefore, A^ = 2560 -|- 25 = 2585. In Fig. [T^b), we set m = 5 
and prepare two memory-one players of each of the 6^ = 1296 types and 25 BM players in the 
beginning. Therefore, A^ = 2592 -|- 25 = 2617. In both cases, the BM players can invade the 
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population of memory-one players to eventually become dominant. Within the memory-one 
players, those with large pcc tend to survive at early stages of evolutionary dynamics before 
they are overwhelmed by the BM player. 



4 Discussion 



We numerically analyzed the behavior of a BM model in the iterated PD. Our model is a 
modification of the BM model used by Macy and Flache (2002) such that the probability of 
cooperation in the next round is made sensitive to the reinforcement signal obtained in the 
current round. Our model is also a close variant of the models used by Karandikar et al. 
(1998) and Posch et al. (1999). When the adaptation of the aspiration level is not too fast, the 
modified BM player mutually cooperates with a large probability. The BM player also performs 
efficiently against reactive strategies and in evolutionary dynamics in a population comprising 
various memory-one strategies. Up to our numerical efforts, the results are robust against the 
error in the action implementation and the change in the payoff matrix describing the PD. 

The BM player performs at least comparably to memory-one players such as GTFT and 
Pavlov, which are strong competitors in the iterated PD. Although the BM player is infe- 
rior to these strategies when playing against TFT, it performs better than GTFT against 
other strategies including ALLC and ALLD. In an evolutionary context, naively cooperat- 
ing with ALLC allows it to prosper by a neutral drift, which eventually invites the inva- 
sion of malicious players such as ALLD. Therefore, it is important to be able to exploit 
ALLC for a strategy to survive in evolutionary dynamics ( Nowak fc Sigmund, 1993 ). This 
property is not satisfied by TFT ( [Axelrod, 1984D , GTFT ( [Nowak fc Sigmund, 1992] ), Pavlov 
([Kraines Kraines, 1989"| |Nowak fc Sigmund, 1993]), and BM model with a fixed aspiration 



level ( |Macy, 1991] |Macy, 1996 1 |Posch et a/., 19991 IMacy fc Flache, 2002[ ). In contrast, our BM 
player as well as the temporal difference learner (|Masuda fc Ohtsuki, 2009|) are capable of ex- 
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ploiting ALLC. 

As a model of humans and other animals in iterated games, a learning strategy may 
be generally disadvantageous as compared to simpler learning strategies and non-learning 
strategies in at least two aspects. First, the number of rounds before learning is estab- 
lished may be large. For example, the temporal difference learning, a type of reinforcement 
learning, must be implemented with a very small learning rate to realize mutual cooperation 
( Masuda fc Ohtsuki, 2009 ). In contrast, the learning of the modified BM model is completed in 
one to some hundreds of rounds. This speed of learning is comparable to that of other learning 
models in which mutual cooperation is obtained within ten to hundred rounds (|Macy, 1991 



Macy, 19961 lErev & Roth, 19981 [E?i^ & Roth, 200T||Hauert fc Stenull, 2002||Macy fc Flache, 2002D 
Second, humans or other animals subjected to social dilemma situations may not implement 
a complex learning strategy. In this aspect, the BM model with the adaptive aspiration 
level, both the original ones ( Karandikar et ai, 1998 Posch et ai, 1999 Macy Flache, 2002 ) 
and ours, has a clear advantage. The BM model is simpler than many learning models in- 
cluding the temporal difference learning (jSandholm fc Crites, 1996} |Masuda fc Ohtsuki, 2009D, 



fictitious play ( |Erev fc Roth, 1998| |Fudenberg fc Levine, 1998| |Camerer, 2003D , genetic algo- 
rithms ( Macy, 1996 ), and artificial neural networks ( Macy, 1996 Sandholm fc Crites, 1996" 
Taiji fc Ikegami, 19991 [Gutnisky fc Zanutto, 2004|). 



The memory-one strategy, for example, can be regarded as a reinforcement learning because 
the probability of cooperation is a function of the outcome in the previous round. This is also 



the case for analogous strategies with longer memory ( Lindgren, 1991 ). Nevertheless, in this 
study, we are concerned with the cases where the probabilities of cooperation conditioned by the 
recent results of the game adapt over time. In the case of the BM model, adaptation is realized 
by the dynamic aspiration level. Learning players in this restricted sense cope with various 
types of opponents more fiexibly than the memory-one strategy or its extension with longer 
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memory. For example, we showed that the mean aspiration level of the BM player is almost 
equal to the mean payoff against different reactive strategies (Sec. 13. 2p . This result indicates 
that the BM player flexibly behaves as different types of win-stay lose-shift strategists depending 
on the opponent. Learning in games is a recent outstanding issue involving interdisciplinary 
research fields such as behavioral game theory and neuroeconomics (Fudenberg fc Levine, 1998 



Camerer, 2003[ |Glimcher et al., 2009D. Because the effect of learning is evident in laboratory 



experiments (Camerer, 2003 Glimcher et al, 2009), it may be important to consider individual 



learning in addition to evolution to understand the behavior of agents, particularly that of 
humans, in social dilemma situations. Our model, which is simple yet competitive in the PD, 
may be used for examining various problems with regard to relationships between learning and 
cooperation in social dilemma situations. 
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Appendix: Reactive strategy against itself 

When the focal player obeys the reactive strategy with parameters p and q, the long-term behav- 
ior of the focal player against the reactive strategy with parameters p and q can be analytically 
calculated (|Nowak fc Sigmund, 19891 |Nowak fc Sigmund, 1990] |Nowak, 1990] |Nowak, 2006]). 



The probability that the focal player cooperates is given by 



^ Qjp - g) + g 
^ 1 - (p- g)(p- g)' 
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The mean payoff of the focal player is given by 



RSiS2 + Ssi{l - S2) + T(l - si)s2 + P{1 - si)(l - S2), 



(6) 



where Si is given by Eq. and 



q{p -q) + q 



(7) 



S2 = 



1 - {p-q){p-q)' 



References 

[Axelrod, 1984] Axelrod, R. 1984. Evolution of Cooperation. Basic Books, NY. 

[Bendor et al, 2003] Bendor, J., Diermeier, D. & Ting, M. 2003. A behavioral model of 
turnout. Am. Political Sci. Rev. 97 (2), 261-280. 

[Borgers & Sarin, 2000] Borgers, T. & Sarin, R. 2000. Naive reinforcement learning with en- 
dogenous aspirations. Int. Econ. Rev. 41 (4), 921-950. 

[Boyd & Richerson, 1985] Boyd, R. & Richerson, P. J. 1985. Culture and the Evolutionary 
Process. The University of Chicago Press, Chicago. 

[Camerer, 2003] Camerer, C. F. 2003. Behavioral Game Theory. Princeton University Press, 



[Cho & Matsui, 2005] Cho, I. K. & Matsui, A. 2005. Learning aspiration in repeated games. 
J. Econ. Theory, 124 (2), 171-201. 

[Cross, 1983] Cross, J. G. 1983. A Theory of Adaptive Economic Behavior. Cambridge Uni- 
versity Press, Cambridge. 

[Daw & Doya, 2006] Daw, N. D. & Doya, K. 2006. The computational neurobiology of learning 
and reward. Curr. Opin. Neurobiol. 16, 199-204. 



NJ. 



18 



[Dixon, 2000] Dixon, H. D. 2000. Keeping up with the Joneses: competition and the evolution 
of collusion. J. Econ. Behav. Organ. 43 (2), 223-238. 

[Erev & Roth, 1998] Erev, I. & Roth, A. E. 1998. Predicting how people play games: reinforce- 
ment learning in experimental games with unique, mixed strategy equilibria. Amer. Econ. 
Rev. 88, 848-881. 

[Erev & Roth, 2001] Erev, I. & Roth, A. E. 2001. Simple reinforcement learning models and 
reciprocation in the prisoner's dilemma game. In: Gigerenzer, G. and Selten, R. (Eds.), The 
Adaptive Toolbox, pp. 215-231. 

[Fudenberg & Levine, 1998] Fudenberg, D. & Levine, D. K. 1998. The Theory of Learning in 
Games. MIT Press, MA. 

[Ghmcher et al, 2009] Ghmcher, P. W., Camerer, C. F., Fehr, E. & Poldrack, R. A. 2009. 
Neuroeconomics - Decision Making and the Brain. Academic Press, London. 

[Gutnisky & Zanutto, 2004] Gutnisky, D. A. & Zanutto, B. S. 2004. Cooperation in the iterated 
prisoner's dilemma is learned by operant conditioning mechanisms. Artif. Life, 10 (4), 433- 
461. 

[Hauert & StenuU, 2002] Hauert, C. & Stenull, O. 2002. Simple adaptive strategy wins the 
prisoner's dilemma. J. Theor. Biol. 218 (3), 261-272. 

[Izquierdo et ai, 2007] Izquierdo, L. R., Izquierdo, S. S., Gotts, N. M. & Polhill, J. G. 2007. 
Transient and asymptotic dynamics of reinforcement learning in games. Games Econ. Behav. 
61 (2), 259-276. 

[Izquierdo et ai, 2008] Izquierdo, S. S., Izquierdo, L. R. & Gotts, N. M. 2008. Reinforcement 
learning dynamics in social dilemmas. J. Artif. Soc. Soc. Simul. 11 (2), 1. 

19 



[Karandikar et ai, 1998] Karandikar, R., Mookherjee, D., Ray, D. & Vega-Redondo, F. 1998. 
Evolving aspirations and cooperation. J. Econ. Theory, 80 (2), 292-331. 

[Kim, 1999] Kim, Y. 1999. Satisficing and optimality in 2 x 2 common interest games. Econ. 
Theory, 13 (2), 365-375. 

[Kraines & Kraines, 1989] Kraines, D. & Kraines, V. 1989. Pavlov and the prisoner's dilemma. 
Theory Decis. 26 (1), 47-79. 

[Lindgren, 1991] Lindgren, K. 1991. Evolutionary phenomena in simple dynamics. Proceedings 
of Artificial Life II, pp. 295-312. 

[Macy, 1996] Macy, M. 1996. Natural selection and social learning in prisoner's dilemma. Sociol. 
Methods Res. 25 (1), 103-137. 

[Macy, 1991] Macy, M. W. 1991. Learning to cooperate: stochastic and tacit collusion in social 
exchange. Am. J. Sociol. 97 (3), 808-843. 

[Macy & Flache, 2002] Macy, M. W. & Flache, A. 2002. Learning dynamics in social dilemmas. 
Proc. Natl. Acad. Sci. USA, 99, 7229-7236. 

[Masuda & Ohtsuki, 2009] Masuda, N. & Ohtsuki, H. 2009. A theoretical analysis of temporal 
difference learning in the iterated prisoner's dilemma game. Bull. Math. Biol. 71, 1818-1850. 

[Montague & Berns, 2002] Montague, P. R. & Berns, G. S. 2002. Neural economics and the 
biological substrates of valuation. Neuron, 36, 265-284. 

[Napel, 2003] Napel, S. 2003. Aspiration adaptation in the ultimatum minigame. Games 
Econom. Behav. 43 (1), 86-106. 



20 



[Nowak, 1990] Nowak, M. 1990. Stochastic strategies in the prisoner's dilemma. Theor. Popul. 
Biol. 38, 93-112. 

[Nowak & Sigmund, 1989] Nowak, M. & Sigmund, K. 1989. Game-dynamical aspects of the 
prisoner's dilemma. Applied Math. Comput. 30, 191-213. 

[Nowak & Sigmund, 1990] Nowak, M. & Sigmund, K. 1990. The evolution of stochastic strate- 
gies in the prisoner's dilemma. Acta Applicandae Math. 20, 247-265. 

[Nowak & Sigmund, 1993] Nowak, M. & Sigmund, K. 1993. A strategy of win-stay, lose-shift 
that outperforms tit-for-tat in the prisoner's dilemma game. Nature, 364, 56-58. 

[Nowak, 2006] Nowak, M. A. 2006. Evolutionary Dynamics. The Belknap Press of Harvard 
University Press, MA. 

[Nowak & Sigmund, 1992] Nowak, M. A. & Sigmund, K. 1992. Tit for tat in heterogeneous 
populations. Nature, 355, 250-253. 

[Nowak et ai, 1995] Nowak, M. A., Sigmund, K. & El-Sedy, E. 1995. Automata, repeated 
games and noise. J. Math. Biol. 33 (7), 703-722. 

[Oechssler, 2002] Oechssler, J. 2002. Cooperation as a result of learning with aspiration levels. 
J. Econom. Behav. Organ. 49 (3), 405-409. 

[Palomino & Vega-Redondo, 1999] Palomino, F. & Vega-Redondo, F. 1999. Convergence of 
aspirations and (partial) cooperation in the prisoner's dilemma. Int. J. Game Theory, 28 (4), 
465-488. 

[Pazgal, 1997] Pazgal, A. 1997. Satisficing leads to cooperation in mutual interests games. Int. 
J. Game Theory, 26 (4), 439-453. 



21 



[Posch et ai, 1999] Posch, M., Pichler, A. & Sigmund, K. 1999. The efficiency of adapting 
aspiration levels. Proc. R. Soc. Lond. B, 266 (1427), 1427-1435. 

[Rapoport & Chammah, 1965] Rapoport, A. & Chammah, A. M. 1965. Prisoner's Dilemma: 
A Study in Conflict and Cooperation. Michigan University Press, Ann Arbor, MI. 

[Sandholm & Crites, 1996] Sandholm, T. W. & Crites, R. H. 1996. Multiagent reinforcement 
learning in the iterated prisoner's dilemma. Biosystems, 37 (1-2), 147-166. 

[Schultz et ai, 1997] Schultz, W., Dayan, P. & Montague, P. R. 1997. A neural substrate of 
prediction and reward. Science, 275, 1593-1599. 

[Sigmund, 2010] Sigmund, K. 2010. The Calculus of Selfishness. Princeton University Press, 
Princeton, NJ. 

[Simon, 1959] Simon, H. A. 1959. Theories of decision-making in economics and behavioral 
science. Amer. Econ. Rev. 49 (3), 253-283. 

[Szabo & Toke, 1998] Szabo, G. & Toke, C. 1998. Evolutionary prisoner's dilemma game on a 
square lattice. Phys. Rev. E, 58 (1), 69-73. 

[Taiji & Ikegami, 1999] Taiji, M. & Ikegami, T. 1999. Dynamics of internal models in game 
players. Physica D, 134 (2), 253-266. 

[Traulsen et al, 2006] Traulsen, A., Nowak, M. A. & Pacheco, J. M. 2006. Stochastic djTiamics 
of invasion and fixation. Phys. Rev. E, 74 (1), 011909. 

[Trivers, 1971] Trivers, R. L. 1971. The evolution of reciprocal altruism. Q. Rev. Biol. 46, 
35-57. 



22 



(a) (b) (c) 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



h h h 

Figure 1: Fraction of cooperation of the BM player playing against another BM player. The 
number of rounds is equal to (a-c) tmax = 1000 and (d-f) tmax = 100. The probability of the 
misimplementation of the action is equal to (a, d) e = 0, (b, e) e = 0.01, and (c, f) e = 0.1. We 
set i? = 3, T = 5, S* = 0, and P = 1, and vary h and /3. 
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Figure 2: Fraction of cooperation of the BM player playing against another BM player in the 
Macy-Flache model. We set tmax = 1000, e = 0, R = 3, T = 5, S = 0, and P = 1, and vary h 
and i. 
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Figure 3: Fraction of cooperation of the BM player playing against another BM player. We 
set tmax = 1000, e = 0.02, /3 = 3,c=l, R = b — c, T = b, S = — c, and P = 0, and vary h and 
b/c. 
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Figure 4: Fraction of cooperation of the BM player playing against another BM player. We 



1000, e = 0.02, /3 = 3, i? = 3, T = 5, ^ = 0, and P = 1, and vary h and Ai. 
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Figure 5: Behavior of a focal BM player with h = 0.3 and /3 = 3 against a BM opponent with 
different values of h and (3. (a) Fraction of cooperation and (b) mean payoff of the focal BM 
player. We set t^ax = 1000, e = 0.02, R = 3, S = 0, T = 5, and P = 1. 
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Figure 6: Fraction of cooperation of (a) BM player, (b) TFT, and (c) GTFT against reactive 
strategies. We set t^ax = 1000, e = 0.02, h = 0.3, /3 = 3, i? = 3, 5 = 0, T = 5, and P = 1. 
Mean payoff of (d) BM player, (e) TFT, and (f) GTFT against reactive strategies. 
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Figure 7: Time courses of evolutionary dynamics. We set tmax = 1000, e = 0.02, h = 0.3, 
/3 = 3, = 3, T = 5, 5 = 0, and P = 1. (a) Results for = 2585, m = 3, and 4^ types 
of memory-one strategies, (b) Results for = 2617, m = 5, and 6^ types of memory-one 
strategies. 
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